This year, a report from Centers for Disease Control and Prevention revealed that America’s obesity rate has reached a record high. Contrasting to popular belief, New Yorkers are not so safe from the obesity epidemic, as more than half of adult New Yorkers are either overweight or obese. Studies show that the rise in the obesity epidemic is partly due to disparities in food environment; it is harder for some to eat healthier because their options are limited.
For example, I and my fellow classamate often go out to McDonald that are near our home for convenience, even though these restaurants were not healthy. Environment can affect our health behaviours imperceptibly.
This project intends to look deeper into the relationship between food environment in NYC and obesity rate along with diabetes rate and stroke hospitalization rate.
Can the percentage of nation-wide chain and fast-food restaurants in different neighborhood of NYC be related to their obesity, diabetes and stroke hospitalization rate?
Can the percentage of fast-food restaurant (defined by cuisine description in the restaurant inspection dataset) in different neighborhood of NYC be related to their obesity, diabetes and stroke hospitalization rate ?
Can the percentage of health inspection grade A restaurant in different neighborhood of NYC be related to their obesity, diabetes and stroke hospitalization rate?
Can the composite restaurant health score of different neighborhood of NYC be related to their obesity, diabetes and stroke hospitalization rate?
For the first three questions, we stick to it. Since the percentage of health inspection grade A restaurant is not a significant predictor, we later put it into the model as a confounder. Moreover, we deleted the fourth problem. Composite restaurant health score were to subjective and difficult to assess. Moreover, we calculated the boro level model first due to the difficulties we encountered in scaping zipcode-neighborhood data. However, after some struggle, we still managed to get the neighborhood information and analyze the data accordingly.
1. NYC restaurant inspection:
https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/43nn-pn8j
This dataset contains the data for restaurant inspection in NYC from August 1, 2014 to December 5th, 2017. Every row is a restaurant inspection record that includes the name of the restaurant, zipcode, cuisine type description, inspection grade (A as the best grade) and so on. Since this dataset has the inspection data for 3 years, it is almost certained that it has the data for all the restuarants in NYC in 2014.
Variable used in this datasets
DBA: Name of the restaurant
BORO: name of the boro
ZIPCODE: the zipcode of the restaurant
CUISINE DESCRIPTION: the kind of food that the restaurant is providing
GRADE: Inspection grade for the specific inspection.
2. 2015 Community Health Profiles Open Data:
https://www1.nyc.gov/site/doh/data/data-publications/profiles.page
This dataset contains NYC every neighborhoods’ demographic (percentage of white race, poverty percentage), health (age-adjusted percent of adult exercised in the last 30 days, age-adjusted percent of adults as a smoker) and our main outcome (Age-adjusted percent of adults that is obese (BMI of 30 or gthe reater), ge-adjusted percent of adults, Age-adjusted rate of hospitalizations due to stroke (cerebrovascular disease) per 100,000 adults)
Variable used in this datasets
Name: the name for the neigborhood.
Racewhite_Rate: the percentage for white race
Poverty: Percent of individuals living below the federal poverty threshold
Smoking: age-adjusted percent of adults as a smoker Exercise: Age-adjusted percent of adults that reported getting any exercise in the last 30 days
Obesity: Age-adjusted percent of adults that is obese (BMI of 30 or greater) based on self-reported height and weight
Diabetes: Age-adjusted percent of adults that had ever been told by a healthcare professional that they have diabetes
Stroke_Hosp: Age-adjusted rate of hospitalizations due to stroke (cerebrovascular disease) per 100,000 adults
3. Web scraping for what zipcodes each neighborhood contains
Because Community Health Profiles Open Data is has only neighborhood level information, so we have to aggregate neighborhood level information from the restaurant inspeciton data. However, the restaurant inspection data was using zipcodes for every restaurant. Therefore, we had to add the neighborhood name to every restaurant, so that we can group by neighborhood then calculate the percentage for every neighborhood.
First, we scraped the table data from https://www.health.ny.gov/statistics/cancer/registry/appendix/neighborhoods.htm. However the neighborhood name and combinations were different from the health profile data we downloaded. We cannot merge the two datasets. Then we tried to look for the raw classification for the neigborhoods in health data from New York University’s Furman Center for Real Estate and Urban Policy and the NYC Department of City Planning. However, still no luck becuase the neighborhood was not divided by zipcode areas. Then we tried the hardest way: search the name of the neighborhood on Google and finding the matching zipcodes. It worked out well, but it was not precise, because neighborhoods were not divided according to the area of the zipcode. Different neighborhoods have the same zipcodes.
In order to be unbiased in this process, if two neighborhoods contains the same zipcode area, we will randomly assign this zipcode to only one neighborhood.
First, we download restaurant inspection data from NYC open data.
get_all_inspections = function(url) {
all_inspections = vector("list", length = 0)
loop_index = 1
chunk_size = 50000
DO_NEXT = TRUE
while (DO_NEXT) {
message("Getting data, page ", loop_index)
all_inspections[[loop_index]] =
GET(url,
query = list(`$order` = "zipcode",
`$limit` = chunk_size,
`$offset` = as.integer((loop_index - 1) * chunk_size)
)
) %>%
content("text") %>%
fromJSON() %>%
as_tibble()
DO_NEXT = dim(all_inspections[[loop_index]])[1] == chunk_size
loop_index = loop_index + 1
}
all_inspections
}
url = "https://data.cityofnewyork.us/resource/9w7m-hzhe.json"
rest_inspection = get_all_inspections(url) %>%
bind_rows()
Then, we download health and demographic data CSV into local folder, read in and clean it.
download.file("https://www1.nyc.gov/assets/doh/downloads/excel/episrv/2015_CHP_PUD.xlsx", mode="wb", destfile = "health.xlsx")
health <- read_excel("health.xlsx", sheet = "CHP_all_data") %>%
select(Name, Racewhite_Rate, Poverty, Unemployment,
Smoking, Exercise,
Obesity, Diabetes, Stroke_Hosp) %>%
clean_names()
Lastly, we match the neighborhood names with restaurant zipcodes.
zip_neighbor <- read_csv("neigh_zipcode.csv") %>%
mutate(zipcode = as.character(zipcode))
##restaurant data with neighbourhood
rest_neighborhood = left_join(rest_inspection, zip_neighbor, by = "zipcode") %>%
filter(!is.na(neighborhood))
health_only_neighborhood <- health[-c(1:6),] %>%
rename(neighborhood = name) %>%
mutate(neighborhood = as.factor(neighborhood))
##Plotting for outcome in different neighborhood
bar_obe <- health_only_neighborhood %>%
mutate(neighborhood = fct_reorder(neighborhood, obesity)) %>%
filter(obesity > 25) %>%
ggplot(aes(x = neighborhood,y = obesity, fill = neighborhood)) + geom_col() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none") +
labs(title = "Neighborhood with 25 percent more obesity rate")
bar_dia <- health_only_neighborhood %>%
mutate(neighborhood = fct_reorder(neighborhood, diabetes)) %>%
filter(diabetes > 10) %>%
ggplot(aes(x = neighborhood,y = diabetes, fill = neighborhood)) + geom_col() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none") +
labs(title = "Neighborhood with 10 percent more diabetes rate")
bar_stro <- health_only_neighborhood %>%
mutate(neighborhood = fct_reorder(neighborhood, stroke_hosp)) %>%
filter(stroke_hosp > 300) %>%
ggplot(aes(x = neighborhood,y = stroke_hosp, fill = neighborhood)) +
geom_col() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none") +
labs(title = "Neighborhood with 300 more stroke hospitalization in 100,000 adults")
ggplotly(bar_obe)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
ggplotly(bar_dia)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
ggplotly(bar_stro)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
Williamsbridge and Baychester have the highest obesity rate, up to 32%. East New York and Starrett City have the highest diabetes rate, up to 17 percent. BUshwick has the highest stroke hospitalization in 100,000 adults, up to 300 adults. Morrisania and Crotona has the highest percent of poverty rate, up to 40%. Fiancial district has the highest percent of young peoeple, up to 67%.
Fastfood restaurants are identified by their cuisine descriptions given in the inspection data. We print out the cuisine type list (n=85) and let everyone circle the ones they think is fastfood and we use the union as our rule (use union because it’s more conservative).
We classify cuisine descriptions “Bagels/Pretzels”, “Bottled beverages, including water, sodas, juices, etc.”, “Chicken”, “Delicatessen”, “Donuts”, “Hamburgers”, “Hotdogs”, “Hotdogs/Pretzels”, “Ice Cream, Gelato, Yogurt, Ices”, “Nuts/Confectionary”, “Pancakes/Waffles”, “Pizza”, “Soul Food”, “Sandwiches”, “Sandwiches/Salads/Mixed Buffet” and “Soups & Sandwiches” as fastfood restaurants. And then we calculate the total number of restaurants…
# calculating the total number of restaurants and the number of fastfood restaurants in the neighborhood, as well as the percentage of fastfood restaurants.
neighborhood_list =
rest_neighborhood %>%
distinct(neighborhood) %>%
arrange(neighborhood)
rest_fastfood_neighborhood =
rest_neighborhood %>%
filter(cuisine_description %in% c("Bagels/Pretzels",
"Bottled beverages, including water, sodas, juices, etc.",
"Chicken",
"Delicatessen",
"Donuts",
"Hamburgers",
"Hotdogs",
"Hotdogs/Pretzels",
"Ice Cream, Gelato, Yogurt, Ices",
"Nuts/Confectionary",
"Pancakes/Waffles",
"Pizza",
"Soul Food",
"Sandwiches",
"Sandwiches/Salads/Mixed Buffet",
"Soups & Sandwiches"))
percent_fastfood_neighborhood = function(name_neighborhood){
rest_each_neighborhood =
rest_neighborhood %>%
filter(neighborhood == name_neighborhood) %>%
distinct(camis)
n_rest_neighborhood = nrow(rest_each_neighborhood)
rest_fastfood_distinct_neighborhood =
rest_fastfood_neighborhood %>%
filter(neighborhood == name_neighborhood) %>%
distinct(camis, cuisine_description)
n_fastfood_neighborhood = nrow(rest_fastfood_distinct_neighborhood)
percent_fastfood_neighborhood = n_fastfood_neighborhood/n_rest_neighborhood
tibble(
neighborhood = name_neighborhood,
n_fastfood = n_fastfood_neighborhood,
n_rest = n_rest_neighborhood,
percent_fastfood = percent_fastfood_neighborhood
)
}
fastfood_neighborhood =
map(neighborhood_list$neighborhood, percent_fastfood_neighborhood) %>%
bind_rows() %>%
mutate(neighborhood = str_to_upper(neighborhood))
fastfood_neighborhood %>%
mutate(neighborhood = as.factor(neighborhood),
n_rest = as.numeric(n_rest),
n_nonfastfood = n_rest - n_fastfood,
neighborhood = fct_reorder(neighborhood, percent_fastfood)) %>%
plot_ly(., x = ~neighborhood, y = ~n_fastfood, type = 'bar', name = 'fastfood restaurants') %>%
add_trace(y = ~n_nonfastfood, name = 'non-fastfood restaurants') %>%
layout(yaxis = list(title = 'number of restaurants'),
xaxis = list(title = 'neighborhood (ordered by percentage of fastfood restaurants)',
tickangle = 45),
barmode = 'stack')
From the plot, we can see that, while the Greenwich Village and SOHO neighborhood has fairly large number of restaurants, it has the smallest percentage of fastfood restaurants. Williamsbridge and Baychester has the largest percentage of fastfood restaurants.
When large number of total restaurants is not equal to large percentage of fastfood restaurants in that neighborhood, we can conclude that the distribution of fastfood restaurants is not even across neighborhoods, which also implies the motivation of our study, we want to investigate if this uneven distribution of fastfood restaurants is associated with diffferent level of chronic disease outcomes within a neighborhood.
We scrape list of 75 national chain restaurants from the wikipedia page (https://en.wikipedia.org/wiki/List_of_restaurant_chains_in_the_United_States#Fast-casual) now I will join this dataset with restaurant inspection data to choose only the chain restaurants in NYC.
chains_html = read_html("https://en.wikipedia.org/wiki/List_of_restaurant_chains_in_the_United_States#Fast-casual")
chain_rest = chains_html %>%
html_nodes("ul:nth-child(9) li , .jquery-tablesorter tr:nth-child(1) td:nth-child(1)") %>%
html_text() %>%
as.tibble() %>%
mutate(dba = value,
dba = str_to_upper(dba)) %>%
select(dba)
# read in the list of chain restaurants in us
# made the names to uppercase and changed the var name to dba
head(chain_rest, 10)
## # A tibble: 10 x 1
## dba
## <chr>
## 1 A&W RESTAURANTS
## 2 APPLEBEE'S
## 3 BAJA FRESH
## 4 BOSTON MARKET
## 5 BUFFALO WILD WINGS
## 6 CHILI'S
## 7 CHIPOTLE MEXICAN GRILL
## 8 CICI'S PIZZA
## 9 COLD STONE CREAMERY
## 10 CORNER BAKERY CAFE
Then, we match list of chain restaurants in U.S. and restaurant inspection data.
# removing punctuations in chain_rest & restaurant inspections (neighborhoods)
chain_rest_str =
chain_rest %>%
mutate(dba = str_replace_all(dba, "[[:punct:]]", ""))
rest_neigh_str = rest_neighborhood %>%
mutate(dba = str_replace_all(dba, "[[:punct:]]", ""))
# Matching the two datasets(restaurant inspection data that has all punctuation removed from dba(restaurant name) and list of chain restaurants by dba)
neighborhood_chain =
right_join(rest_neigh_str, chain_rest_str) %>%
filter(!is.na(camis)) %>%
distinct(camis, dba, neighborhood, boro)
## Joining, by = "dba"
neighborhood_chain %>%
group_by(dba) %>%
summarise(n = n()) %>%
arrange(desc(n)) %>%
head(10)
## # A tibble: 10 x 2
## dba n
## <chr> <int>
## 1 DUNKIN DONUTS 453
## 2 SUBWAY 364
## 3 STARBUCKS 307
## 4 MCDONALDS 216
## 5 CHIPOTLE MEXICAN GRILL 78
## 6 WENDYS 46
## 7 APPLEBEES 28
## 8 BOSTON MARKET 24
## 9 WHITE CASTLE 23
## 10 PIZZA HUT 21
The combined dataframe, neighborhood_chain has 1739 observations. Also, there were 33 different chain restaurants extracted. The chart is showing the 10 chain restaurants in descending order.
# counting chains in neighborhoods
neigh_count_chain = neighborhood_chain %>%
group_by(neighborhood, boro) %>%
summarise(chain_n = n())
neigh_count_rest = rest_neighborhood %>%
distinct(neighborhood, camis) %>%
group_by(neighborhood) %>%
summarise(res_n = n())
# calculating percentage of chains in each boro
percent_neighborhood_chain = left_join(neigh_count_chain, neigh_count_rest) %>%
ungroup() %>%
mutate(chain_percentage = chain_n/res_n,
neighborhood = str_to_upper(neighborhood))
## Joining, by = "neighborhood"
plot_chain_neighbor = percent_neighborhood_chain %>%
mutate(neighborhood = forcats::fct_reorder(neighborhood, chain_percentage)) %>%
ggplot(aes(neighborhood, chain_percentage, fill = boro)) + geom_bar(stat="identity")
ggplotly(plot_chain_neighbor)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
We have plotted neighborhoods of NYC against their percentage of chain restaurants and grouped them by borough. We see that all the neighborhodds that had a percentage of chain restuarants less than 5% were all part of Brooklyn except Greenwhich village and Soho of Manhattan. Neighborhoods in Queens and Manhattan are spread out across low to high percentage of chain restaurants while most of neighborhoods in Bronx and Staten Island have high percentages. The neighbor hood that had the highest percentage of chain restaurants were Throgs neck and Co-op City of Bronx with 16.5% of chain restaurats out of all restaurats.
gradea_neighborhood =
rest_neighborhood %>%
group_by(neighborhood, grade) %>%
summarise(n = n()) %>%
mutate(grade_percent = n / sum(n)) %>%
filter(grade == "A") %>%
ungroup(boro) %>%
mutate(neighborhood = str_to_upper(neighborhood))
gradea_neighborhood %>%
mutate(neighborhood = as.factor(neighborhood),
neighborhood = fct_reorder(neighborhood,grade_percent)) %>%
plot_ly(x = ~neighborhood, y = ~grade_percent, color = ~neighborhood, type = "bar") %>%
layout(showlegend = FALSE)
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
The difference of percentage of “grade-A” restaurants in each neighborhood is observed. “Throgs Neck and Co-op City” has the greatest grade-A restaurant percentage, around 44.50%. “Sunset Park”, however, has the least, around 30.62%. “Washington Heights”, where we live, takes the fourth from the end, 34.39%, which is obviously consistent with the feeling we have towards the restaurant condition of “Washington Heights”.
health_neighborhood =
health %>%
mutate(neighborhood = str_to_upper(name)) %>%
select(-name)
combined_chain =
percent_neighborhood_chain %>%
select(neighborhood, chain_percentage)
combined_chain_fastfood =
fastfood_neighborhood %>%
mutate(fastfood_percent = percent_fastfood) %>%
select(neighborhood, fastfood_percent) %>%
right_join(combined_chain, by = "neighborhood")
combined_chain_fastfood_gradea =
gradea_neighborhood %>%
select(neighborhood, grade_percent) %>%
right_join(combined_chain_fastfood, by = "neighborhood")
combined_model =
left_join(combined_chain_fastfood_gradea, health_neighborhood, by = "neighborhood")
We originally have two main predictors of interest, percentage of chain restaurants and percentage of fastfood restaurants, and they are both significantly associated with the three chronic disease health oucomes when solely in the model after adjusting for other potential confounders. Taking obesity as example, the p-value for fastfood_percent is 0.0006801903 and for chain_percentage is 0.012419514.
outcome_name = combined_model[,10:12]
main_predictor_selection = function(outcome){
lm_fastfood =
lm(outcome ~ fastfood_percent + grade_percent + racewhite_rate + poverty + smoking + exercise, combined_model) %>%
broom::tidy()
lm_chain =
lm(outcome ~ chain_percentage + grade_percent + racewhite_rate + poverty + smoking + exercise, combined_model) %>%
broom::tidy()
lm_both =
lm(outcome ~ fastfood_percent + chain_percentage + grade_percent + racewhite_rate + poverty + smoking + exercise, combined_model) %>%
broom::tidy()
lm_compare = bind_rows(lm_fastfood, lm_chain, lm_both)
}
map(outcome_name, main_predictor_selection)
## $obesity
## term estimate std.error statistic p.value
## 1 (Intercept) 31.92112433 15.39841618 2.0730135 0.0431432532
## 2 fastfood_percent 59.42872810 16.44420636 3.6139615 0.0006801903
## 3 grade_percent -28.28758943 27.92319074 -1.0130500 0.3157269432
## 4 racewhite_rate -0.09256862 0.04288468 -2.1585476 0.0355225376
## 5 poverty 0.05144636 0.09563734 0.5379318 0.5929191610
## 6 smoking 0.76326783 0.24626934 3.0993214 0.0031276157
## 7 exercise -0.23249722 0.14391114 -1.6155610 0.1122411503
## 8 (Intercept) 26.92389210 16.12679737 1.6695126 0.1010260078
## 9 chain_percentage 68.24399981 26.34821588 2.5900805 0.0124195142
## 10 grade_percent -6.19002972 27.74920917 -0.2230705 0.8243546127
## 11 racewhite_rate -0.13087817 0.04186736 -3.1260190 0.0028985047
## 12 poverty 0.13579730 0.10658678 1.2740539 0.2083092151
## 13 smoking 0.83024365 0.25754271 3.2237124 0.0021876241
## 14 exercise -0.22043017 0.15328119 -1.4380771 0.1564035643
## 15 (Intercept) 31.44610974 15.55231111 2.0219573 0.0484418351
## 16 fastfood_percent 52.82345911 22.03386102 2.3973764 0.0202135830
## 17 chain_percentage 15.25392784 33.53667726 0.4548431 0.6511523758
## 18 grade_percent -29.75987008 28.32416882 -1.0506882 0.2983551020
## 19 racewhite_rate -0.09200984 0.04323297 -2.1282332 0.0381710626
## 20 poverty 0.07102154 0.10554794 0.6728842 0.5040586289
## 21 smoking 0.76030442 0.24825469 3.0625984 0.0034987282
## 22 exercise -0.22233682 0.14673168 -1.5152612 0.1358801114
##
## $diabetes
## term estimate std.error statistic p.value
## 1 (Intercept) 17.17647608 6.47822240 2.6514181 1.059925e-02
## 2 fastfood_percent 16.91910979 6.91819372 2.4455964 1.788278e-02
## 3 grade_percent 1.69749345 11.74748349 0.1444985 8.856655e-01
## 4 racewhite_rate -0.07874514 0.01804189 -4.3645736 6.073260e-05
## 5 poverty 0.04837026 0.04023530 1.2021846 2.347381e-01
## 6 smoking 0.15236099 0.10360725 1.4705630 1.474349e-01
## 7 exercise -0.14495032 0.06054443 -2.3941147 2.030169e-02
## 8 (Intercept) 15.74127215 6.62715596 2.3752681 2.125819e-02
## 9 chain_percentage 18.23537394 10.82755192 1.6841641 9.814418e-02
## 10 grade_percent 8.59989177 11.40327696 0.7541597 4.541551e-01
## 11 racewhite_rate -0.09040128 0.01720500 -5.2543608 2.811443e-06
## 12 poverty 0.07079734 0.04380084 1.6163468 1.120708e-01
## 13 smoking 0.17304831 0.10583476 1.6350801 1.080722e-01
## 14 exercise -0.14288729 0.06298947 -2.2684315 2.748259e-02
## 15 (Intercept) 17.10694576 6.55462106 2.6099061 1.185888e-02
## 16 fastfood_percent 15.95226281 9.28631176 1.7178255 9.189529e-02
## 17 chain_percentage 2.23279537 14.13424728 0.1579706 8.751038e-01
## 18 grade_percent 1.48198821 11.93740224 0.1241466 9.016872e-01
## 19 racewhite_rate -0.07866335 0.01822081 -4.3172254 7.285939e-05
## 20 poverty 0.05123558 0.04448385 1.1517793 2.547826e-01
## 21 smoking 0.15192723 0.10462853 1.4520631 1.526087e-01
## 22 exercise -0.14346309 0.06184101 -2.3198699 2.438747e-02
##
## $stroke_hosp
## term estimate std.error statistic p.value
## 1 (Intercept) 128.7330288 159.5106855 0.80704956 4.233141e-01
## 2 fastfood_percent 573.5406491 170.3439236 3.36695690 1.436183e-03
## 3 grade_percent -414.8614240 289.2535987 -1.43424810 1.574881e-01
## 4 racewhite_rate -1.3739246 0.4442382 -3.09276575 3.186416e-03
## 5 poverty 1.5643605 0.9906978 1.57904910 1.203893e-01
## 6 smoking 6.6038469 2.5510800 2.58864751 1.246526e-02
## 7 exercise 1.9095755 1.4907614 1.28093973 2.058984e-01
## 8 (Intercept) 76.3103399 173.7569200 0.43917871 6.623517e-01
## 9 chain_percentage 258.5291531 283.8867962 0.91067692 3.666679e-01
## 10 grade_percent 3.3420325 298.9816890 0.01117805 9.911242e-01
## 11 racewhite_rate -1.9949362 0.4510966 -4.42241412 5.003035e-05
## 12 poverty 1.8462919 1.1484110 1.60769264 1.139582e-01
## 13 smoking 7.7931853 2.7748739 2.80848273 6.994899e-03
## 14 exercise 1.5659342 1.6515163 0.94817972 3.474238e-01
## 15 (Intercept) 146.1759388 157.2801408 0.92939858 3.570609e-01
## 16 fastfood_percent 816.0913302 222.8278961 3.66242892 5.940958e-04
## 17 chain_percentage -560.1362476 339.1555946 -1.65156128 1.047690e-01
## 18 grade_percent -360.7981184 286.4416246 -1.25958690 2.135524e-01
## 19 racewhite_rate -1.3944434 0.4372140 -3.18938431 2.438602e-03
## 20 poverty 0.8455446 1.0674037 0.79215066 4.319421e-01
## 21 smoking 6.7126655 2.5105936 2.67373639 1.005194e-02
## 22 exercise 1.5364776 1.4838939 1.03543631 3.053501e-01
Since we also anticipated them to be highly associated with each other, to avoid collinearity, we will need to make decision on which one to keep as the final main predictor. So we put the two predictors in the same model and see how their p-values change. As a result, the fastfood_percent stays significant (p=0.020213583) and the chain_percentage turn insignificant (p=0.651152376).
Thus, we keep percentage of fastfood restaurants (fastfood_percent) as the main predictor.
outcome_name = combined_model[,10:12]
confounder_percent_change = function(outcome){
lm_adjusted =
lm(outcome ~ fastfood_percent + grade_percent + racewhite_rate + poverty + smoking + exercise, combined_model) %>%
summary()
lm_adjusted_tbl =
as.tibble(lm_adjusted[[4]]) %>%
clean_names()
lm_crude =
lm(outcome ~ fastfood_percent + racewhite_rate + poverty + smoking + exercise, combined_model) %>%
summary()
lm_crude_tbl =
as.tibble(lm_crude[[4]]) %>%
clean_names()
percent_change = (lm_crude_tbl$estimate[2] - lm_adjusted_tbl$estimate[2]) /lm_crude_tbl$estimate[2]
confounder_percent_change = percent_change
}
map(outcome_name, confounder_percent_change)
## $obesity
## [1] -0.1883358
##
## $diabetes
## [1] 0.03232614
##
## $stroke_hosp
## [1] -0.3172496
Besides some biologically meaningful covariates (i.e. race, poverty, smoking status, exercise), we also hypothesize the variable percentage of grade A restuarants as a potential confounder in the association between fastfood restaurants percentage and the three chronic disease health outcomes.
Here, we are assessing if grade_percent is a significant confounder statistically. Using the 10% change rule of thumb, we find that after adjusting for percentage of grade A restuarants, the estimates of fastfood_percent change by 18.83% for outcome obsesity, 3.23% for outcome diabetes and 31.72% for outcome stroke hospitalization.
Thus, regarding the final models, we are going to keep grade_percent for obsesity and stroke hospitalization, but take it out and rerun the model for the outcome diabetes.
Model 1: Obesity = fastfood_percent + grade_percent + racewhite_rate + poverty + smoking + exercise
lm(obesity ~ fastfood_percent + grade_percent + racewhite_rate + poverty + smoking + exercise, combined_model) %>%
broom::tidy() %>%
kable()
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 31.9211243 | 15.3984162 | 2.0730135 | 0.0431433 |
| fastfood_percent | 59.4287281 | 16.4442064 | 3.6139615 | 0.0006802 |
| grade_percent | -28.2875894 | 27.9231907 | -1.0130500 | 0.3157269 |
| racewhite_rate | -0.0925686 | 0.0428847 | -2.1585476 | 0.0355225 |
| poverty | 0.0514464 | 0.0956373 | 0.5379318 | 0.5929192 |
| smoking | 0.7632678 | 0.2462693 | 3.0993214 | 0.0031276 |
| exercise | -0.2324972 | 0.1439111 | -1.6155610 | 0.1122412 |
Model 2: Diabetes = fastfood_percent + racewhite_rate + poverty + smoking + exercise
lm(diabetes ~ fastfood_percent + racewhite_rate + poverty + smoking + exercise, combined_model) %>%
broom::tidy() %>%
kable()
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 17.6786699 | 5.4163283 | 3.263958 | 0.0019270 |
| fastfood_percent | 17.4843100 | 5.6533466 | 3.092736 | 0.0031611 |
| racewhite_rate | -0.0778464 | 0.0167787 | -4.639603 | 0.0000233 |
| poverty | 0.0469398 | 0.0386366 | 1.214906 | 0.2297869 |
| smoking | 0.1529457 | 0.1025675 | 1.491172 | 0.1418447 |
| exercise | -0.1443873 | 0.0598582 | -2.412154 | 0.0193531 |
Model 3: Stroke_hosp = fastfood_percent + grade_percent + racewhite_rate + poverty + smoking + exercise
lm(stroke_hosp ~ fastfood_percent + grade_percent + racewhite_rate + poverty + smoking + exercise, combined_model) %>%
broom::tidy() %>%
kable()
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 128.733029 | 159.5106855 | 0.8070496 | 0.4233141 |
| fastfood_percent | 573.540649 | 170.3439236 | 3.3669569 | 0.0014362 |
| grade_percent | -414.861424 | 289.2535987 | -1.4342481 | 0.1574881 |
| racewhite_rate | -1.373925 | 0.4442382 | -3.0927657 | 0.0031864 |
| poverty | 1.564361 | 0.9906978 | 1.5790491 | 0.1203893 |
| smoking | 6.603847 | 2.5510800 | 2.5886475 | 0.0124653 |
| exercise | 1.909576 | 1.4907614 | 1.2809397 | 0.2058984 |
We have three final models each having a health outcome (prevalence of obesity, prevalence of diabetes, and stroke hospitalization rate) as the dependent variable and percentage of fastfood as the main predictor. Obesity and percentage of fastfood restuarants model gave the most significant results with p-value of the main predictor being 0.0005364 (<0.01). In other words, at significance level of 1%, every 0.1 percent increase in fastfood restaurants in a neighborhood is associated with 5% increase in the neighborhood’s obesity prevalence, while adjusting for other factors.
The p-value of the regression coefficients for model with outcomes as diabetes/stroke were both arround 0.003 (<0.01), indicating there is a significant linear relationship between diabetese/stroke and percentage of fastfood restaurants at 1% significance level. To put in other words, at significance level of 1%, every 1 percent increase in fastfood restaurants in a neighborhood is associated with 17.5% increase in the neighborhood’s diabetes prevalence, while adjusting for other factors. Furthermore, every 1 percent increase in fastfood restaurants in a neighborhood is associated with increase in 573.5 stroke hospitalizations per 100,000 adults in the neighborhood, while adjusting for other factors.
It is reasonable that among the three health outcomes, obesity has the strongest linear association with food environment, in this case, percentage of fastfood restaurants. It is a health condition that has the most direct relation with one’s diet. Moreover, it has a higher prevalence than diabetes and stroke, which could lead to having a lower p-value than the other two health conditions.
Our 2015 Community Health Profiles Open Data has data on obesity, diabetes prevalence rate from 2013 and stroke hospitalization rate from 2012. And the data of geographic distribution of restaurants is from 2014 to 2017. So the health profile data is somewhat earlier than the restaurant geographic distribution. However, as restaurant distribution and chronic disease prevalence rate wouldn’t change much over a few years, the lag between years don’t actually affect our results that much. Therefore, our results can still be valid.
Although the cuisine type of a restaurant is typically assumed to directly reflect the health level of the food the restaurant provides, we demonstrated here that even the unhealthiest restaurant can offer one or several healthy food, which may bring bias to the research. This was not a low-probability event, it should have required more efficient planning to resolve. Besides, the study at first treats chain restaurants as a cutoff point for unhealthy restaurants. Obviously this cutoff point is ambiguous and farfetched. In the end, the study doesn’t include this as a predictor.
Lastly, although there is an association between fast-food restaurant distribution and health outcomes, we could not conclude that it is fast-food restaurant causing these health outcomes. Maybe it is because the citizens living in the neighborhood are obese indicating that they like to eat fast food. This will actually result in more fast-food restaurant opening in this neighborhood since the business will be good. Further study need to be conducted before concluding any causation effect.
Overall, we conclude that there is a significant relationship between obesity/diabetes/stroke and the geographical distribution of fast-food restaurants.